Human and automatic speech recognition in the presence of speech-intrinsic variations

نویسنده

Bernd T. Meyer

چکیده

Despite several decades of research, automatic speech recognition (ASR) lacks the performance achieved by human listeners. One of the major challenges in ASR is to cope with the immense variability of spoken language, which can be categorized into extrinsic sources (e.g., additive noise) and intrinsic factors (such as speaking rate, style, effort, dialect, and accent). What can we learn from the biological blueprint, and which cues important in human speech recognition (HSR) should be considered to improve ASR performance? The scope of this thesis is to answer these questions by comparing the HSR and ASR performance and based on these results to suggest an alternative way of feature extraction to improve ASR. The comparison is based on the Oldenburg Logatome Corpus, which is a database that contains simple nonsense words consisting of phoneme triplets and which covers the intrinsic variations mentioned above. The man-machine-gap in terms of the signal-to-noise ratio (SNR) was estimated to be 15 dB, i.e., the masking level in ASR has to be lowered by 15 dB to achieve the same performance as human listeners. The contributions to this gap could be attributed to the individual processing steps of the ASR system: The information loss caused by the feature extraction resulted in an SNR-equivalent information loss of 10 dB, while suboptimal classification accounted for the remaining 5 dB of the overall gap. Moreover, the analysis of intrinsic variations showed that human listeners are superior to ASR systems in exploiting temporal cues. These findings motivated the use of spectro-temporal Gabor features in ASR, which were found to exhibit increased robustness against a wide range of noise types. In the presence of intrinisic variations of speech, Gabor features increase the overall performance regarding several factors (such as speaking effort and style), which suggests to incorporate both spectro-temporal and temporal cues in future ASR systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

Designing and implementing a system for Automatic recognition of Persian letters by Lip-reading using image processing methods

For many years, speech has been the most natural and efficient means of information exchange for human beings. With the advancement of technology and the prevalence of computer usage, the design and production of speech recognition systems have been considered by researchers. Among this, lip-reading techniques encountered with many challenges for speech recognition, that one of the challenges b...

متن کامل

Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods

Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...

متن کامل

شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار

Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...

متن کامل

مدل میکروسکوپی دوگوشی مبتنی بر فیلتر بانک مدولاسیون برای پیش گویی قابلیت فهم گفتار در افراد دارای شنوایی عادی

In this study, a binaural microscopic model for the prediction of speech intelligibility based on the modulation filter bank is introduced. So far, the spectral criteria such as the STI and SII or other analytical methods have been used in the binaural models to determine the binaural intelligibility. In the proposed model, unlike all models of binaural intelligibility prediction, an automatic ...

متن کامل

Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its perfor...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Human and automatic speech recognition in the presence of speech-intrinsic variations

نویسنده

چکیده

منابع مشابه

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Designing and implementing a system for Automatic recognition of Persian letters by Lip-reading using image processing methods

Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods

شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار

مدل میکروسکوپی دوگوشی مبتنی بر فیلتر بانک مدولاسیون برای پیش گویی قابلیت فهم گفتار در افراد دارای شنوایی عادی

Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

عنوان ژورنال:

اشتراک گذاری